In this notebook we'll analyze some of Joyce's wordplay in Ulysses, using more complicated regular expressions.
To tokenize the chapter and throw out the punctuation, we can use the regular expression \w+
. Note that this will split up contractions like "can't" into ["can","t"]
.
%matplotlib inline
import nltk, re, io
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib.pylab import *
txtfile = 'txt/08lestrygonians.txt'
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
with io.open(txtfile) as f:
tokens = tokenizer.tokenize(f.read())
print tokens[1000:1020]
print tokenizer.tokenize("can't keep a contraction together!")
The first method for searching for regular expressions in a set of tokens is the TokenSearcher object. This can be fed a regular expression that searches across tokens, and it will search through each token. This provides a big advantage: we don't have to manually break all of our tokens into n-grams ourselves, we can just let the TokenSearcher do the hard work.
Here's an example of how to create and call that object:
tsearch = nltk.TokenSearcher(tokens)
s_s_ = tsearch.findall(r'<s.*> <.*> <s.*> <.*> <.*>')
print len(s_s_)
for s in s_s_:
print ' '.join(s)
Another way of searching for patterns, one that may be needed if we want to use criteria that would be hard to implement with a regular expression (such as finding two words that are the same length next to each other), is to assemble all of the tokens into bigrams.
Suppose we are looking for two words that start with the same letter. We can do this by iterating through a set of bigrams (we'll use a built-in NLTK object to generate bigrams), and apply our search criteria to the first and second words independently.
To create bigrams, we'll use the nltk.bigrams()
method, feeding it a list of tokens.
When we do this, we can see there's a lot of alliteration in this chapter.
def printlist(the_list):
for item in the_list:
print item
alliteration = []
for (i,j) in nltk.bigrams(tokens):
if i[:1]==j[:1]:
alliteration.append( ' '.join([i,j]) )
print "Found",len(alliteration),"pairs of words starting with the same letter:"
printlist(alliteration[:10])
printlist(alliteration[-10:])
lolly = []
for (i,j) in nltk.bigrams(tokens):
if len( re.findall('ll',i) )>0:
if len( re.findall('l',j) )>0:
lolly.append( ' '.join([i,j]) )
elif len( re.findall('ll',j) )>0:
if len( re.findall('l',i) )>0:
lolly.append(' '.join([i,j]) )
print "Found",len(lolly),"pairs of words, one containing 'll' and the other containing 'l':"
print "First 25:"
printlist(lolly[:25])
lolly = []
for (i,j) in nltk.bigrams(tokens):
if len( re.findall('rr',i) )>0:
if len( re.findall('r',j) )>0:
lolly.append( ' '.join([i,j]) )
elif len( re.findall('rr',j) )>0:
if len( re.findall('r',i) )>0:
lolly.append(' '.join([i,j]) )
print "Found",len(lolly),"pairs of words, one containing 'r' and the other containing 'r':"
printlist(lolly)
We can functionalize the search for patterns with a single and double character shared, i.e., dropping currants
(the letter r).
def double_letter_alliteration(c,tokens):
"""
This function finds all occurrences of double-letter and single-letter
occurrences of the character c.
This function is called by all_double_letter_alliteration().
"""
allall = []
for (i,j) in nltk.bigrams(tokens):
if len( re.findall(c+c,i) )>0:
if len( re.findall(c,j) )>0:
lolly.append( ' '.join([i,j]) )
elif len( re.findall(c+c,j) )>0:
if len( re.findall(c,i) )>0:
allall.append(' '.join([i,j]) )
return allall
Now we can use this function to search for the single-double letter pattern individually, or we can define a function that will loop over all 26 letters to find all matching patterns.
printlist(double_letter_alliteration('r',tokens))
printlist(double_letter_alliteration('o',tokens))
import string
def all_double_letter_alliteration(tokens):
all_all = []
alphabet = list(string.ascii_lowercase)
for aleph in alphabet:
results = double_letter_alliteration(aleph,tokens)
print "Matching",aleph,":",len(results)
all_all += results
return all_all
allall = all_double_letter_alliteration(tokens)
print len(allall)
That's a mouthful of alliteration! We can compare the number of words that matched this (one, single) search for examples of alliteration to the total number of words in the chapter:
double(len(allall))/len(tokens)
Holy cow - 2.6% of the chapter is just spent on this one alliteration pattern.
print len(allall)
printlist(allall[:20])
Let's take a look at some acronyms. For this application, it might be better to tokenize by sentence, and extract acronyms for sentences.
with io.open(txtfile) as f:
sentences = nltk.sent_tokenize(f.read())
print len(sentences)
acronyms = []
for s in sentences:
s2 = re.sub('\n',' ',s)
words = s2.split(" ")
acronym = ''.join(w[0] for w in words if w<>u'')
acronyms.append(acronym)
print len(acronyms)
print "-"*20
printlist(acronyms[:10])
print "-"*20
printlist(sentences[:10]) # <-- contains newlines, but removed to create acronyms
from nltk.corpus import words
acronyms[101:111]